18 research outputs found

    Parameterized Indexed Value Function for Efficient Exploration in Reinforcement Learning

    Full text link
    It is well known that quantifying uncertainty in the action-value estimates is crucial for efficient exploration in reinforcement learning. Ensemble sampling offers a relatively computationally tractable way of doing this using randomized value functions. However, it still requires a huge amount of computational resources for complex problems. In this paper, we present an alternative, computationally efficient way to induce exploration using index sampling. We use an indexed value function to represent uncertainty in our action-value estimates. We first present an algorithm to learn parameterized indexed value function through a distributional version of temporal difference in a tabular setting and prove its regret bound. Then, in a computational point of view, we propose a dual-network architecture, Parameterized Indexed Networks (PINs), comprising one mean network and one uncertainty network to learn the indexed value function. Finally, we show the efficacy of PINs through computational experiments.Comment: 17 pages, 4 figures, Proceedings of the 34th AAAI Conference on Artificial Intelligenc

    Sample-Efficient Multi-Agent RL: An Optimization Perspective

    Full text link
    We study multi-agent reinforcement learning (MARL) for the general-sum Markov Games (MGs) under the general function approximation. In order to find the minimum assumption for sample-efficient learning, we introduce a novel complexity measure called the Multi-Agent Decoupling Coefficient (MADC) for general-sum MGs. Using this measure, we propose the first unified algorithmic framework that ensures sample efficiency in learning Nash Equilibrium, Coarse Correlated Equilibrium, and Correlated Equilibrium for both model-based and model-free MARL problems with low MADC. We also show that our algorithm provides comparable sublinear regret to the existing works. Moreover, our algorithm combines an equilibrium-solving oracle with a single objective optimization subprocedure that solves for the regularized payoff of each deterministic joint policy, which avoids solving constrained optimization problems within data-dependent constraints (Jin et al. 2020; Wang et al. 2023) or executing sampling procedures with complex multi-objective optimization problems (Foster et al. 2023), thus being more amenable to empirical implementation

    Learning in Congestion Games with Bandit Feedback

    Full text link
    In this paper, we investigate Nash-regret minimization in congestion games, a class of games with benign theoretical structure and broad real-world applications. We first propose a centralized algorithm based on the optimism in the face of uncertainty principle for congestion games with (semi-)bandit feedback, and obtain finite-sample guarantees. Then we propose a decentralized algorithm via a novel combination of the Frank-Wolfe method and G-optimal design. By exploiting the structure of the congestion game, we show the sample complexity of both algorithms depends only polynomially on the number of players and the number of facilities, but not the size of the action set, which can be exponentially large in terms of the number of facilities. We further define a new problem class, Markov congestion games, which allows us to model the non-stationarity in congestion games. We propose a centralized algorithm for Markov congestion games, whose sample complexity again has only polynomial dependence on all relevant problem parameters, but not the size of the action set.Comment: 34 pages, Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022

    A/B Testing and Best-arm Identification for Linear Bandits with Robustness to Non-stationarity

    Full text link
    We investigate the fixed-budget best-arm identification (BAI) problem for linear bandits in a potentially non-stationary environment. Given a finite arm set XRd\mathcal{X}\subset\mathbb{R}^d, a fixed budget TT, and an unpredictable sequence of parameters {θt}t=1T\left\lbrace\theta_t\right\rbrace_{t=1}^{T}, an algorithm will aim to correctly identify the best arm x:=argmaxxXxt=1Tθtx^* := \arg\max_{x\in\mathcal{X}}x^\top\sum_{t=1}^{T}\theta_t with probability as high as possible. Prior work has addressed the stationary setting where θt=θ1\theta_t = \theta_1 for all tt and demonstrated that the error probability decreases as exp(T/ρ)\exp(-T /\rho^*) for a problem-dependent constant ρ\rho^*. But in many real-world A/B/nA/B/n multivariate testing scenarios that motivate our work, the environment is non-stationary and an algorithm expecting a stationary setting can easily fail. For robust identification, it is well-known that if arms are chosen randomly and non-adaptively from a G-optimal design over X\mathcal{X} at each time then the error probability decreases as exp(TΔ(1)2/d)\exp(-T\Delta^2_{(1)}/d), where Δ(1)=minxx(xx)1Tt=1Tθt\Delta_{(1)} = \min_{x \neq x^*} (x^* - x)^\top \frac{1}{T}\sum_{t=1}^T \theta_t. As there exist environments where Δ(1)2/d1/ρ\Delta_{(1)}^2/ d \ll 1/ \rho^*, we are motivated to propose a novel algorithm P1\mathsf{P1}-RAGE\mathsf{RAGE} that aims to obtain the best of both worlds: robustness to non-stationarity and fast rates of identification in benign settings. We characterize the error probability of P1\mathsf{P1}-RAGE\mathsf{RAGE} and demonstrate empirically that the algorithm indeed never performs worse than G-optimal design but compares favorably to the best algorithms in the stationary setting.Comment: 25 pages, 6 figure

    One Objective to Rule Them All: A Maximization Objective Fusing Estimation and Planning for Exploration

    Full text link
    In online reinforcement learning (online RL), balancing exploration and exploitation is crucial for finding an optimal policy in a sample-efficient way. To achieve this, existing sample-efficient online RL algorithms typically consist of three components: estimation, planning, and exploration. However, in order to cope with general function approximators, most of them involve impractical algorithmic components to incentivize exploration, such as optimization within data-dependent level-sets or complicated sampling procedures. To address this challenge, we propose an easy-to-implement RL framework called \textit{Maximize to Explore} (\texttt{MEX}), which only needs to optimize \emph{unconstrainedly} a single objective that integrates the estimation and planning components while balancing exploration and exploitation automatically. Theoretically, we prove that \texttt{MEX} achieves a sublinear regret with general function approximations for Markov decision processes (MDP) and is further extendable to two-player zero-sum Markov games (MG). Meanwhile, we adapt deep RL baselines to design practical versions of \texttt{MEX}, in both model-free and model-based manners, which can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards. Compared with existing sample-efficient online RL algorithms with general function approximations, \texttt{MEX} achieves similar sample efficiency while enjoying a lower computational cost and is more compatible with modern deep RL methods

    Substantial transition to clean household energy mix in rural China

    Get PDF
    The household energy mix has significant impacts on human health and climate, as it contributes greatly to many health- and climate-relevant air pollutants. Compared to the well-established urban energy statistical system, the rural household energy statistical system is incomplete and is often associated with high biases. Via a nationwide investigation, this study revealed high contributions to energy supply from coal and biomass fuels in the rural household energy sector, while electricity comprised ∼20%. Stacking (the use of multiple sources of energy) is significant, and the average number of energy types was 2.8 per household. Compared to 2012, the consumption of biomass and coals in 2017 decreased by 45% and 12%, respectively, while the gas consumption amount increased by 204%. Increased gas and decreased coal consumptions were mainly in cooking, while decreased biomass was in both cooking (41%) and heating (59%). The time-sharing fraction of electricity and gases (E&G) for daily cooking grew, reaching 69% in 2017, but for space heating, traditional solid fuels were still dominant, with the national average shared fraction of E&G being only 20%. The non-uniform spatial distribution and the non-linear increase in the fraction of E&G indicated challenges to achieving universal access to modern cooking energy by 2030, particularly in less-developed rural and mountainous areas. In some non-typical heating zones, the increased share of E&G for heating was significant and largely driven by income growth, but in typical heating zones, the time-sharing fraction was <5% and was not significantly increased, except in areas with policy intervention. The intervention policy not only led to dramatic increases in the clean energy fraction for heating but also accelerated the clean cooking transition. Higher income, higher education, younger age, less energy/stove stacking and smaller family size positively impacted the clean energy transition

    Robust estimation of bacterial cell count from optical density

    Get PDF
    Optical density (OD) is widely used to estimate the density of cells in liquid culture, but cannot be compared between instruments without a standardized calibration protocol and is challenging to relate to actual cell count. We address this with an interlaboratory study comparing three simple, low-cost, and highly accessible OD calibration protocols across 244 laboratories, applied to eight strains of constitutive GFP-expressing E. coli. Based on our results, we recommend calibrating OD to estimated cell count using serial dilution of silica microspheres, which produces highly precise calibration (95.5% of residuals &lt;1.2-fold), is easily assessed for quality control, also assesses instrument effective linear range, and can be combined with fluorescence calibration to obtain units of Molecules of Equivalent Fluorescein (MEFL) per cell, allowing direct comparison and data fusion with flow cytometry measurements: in our study, fluorescence per cell measurements showed only a 1.07-fold mean difference between plate reader and flow cytometry data

    Using Wasserstein GAN to generate high quality adversarial examples

    No full text
    Although Deep Neural Networks (DNNs) have state-of-the-art performance in various machine learning tasks, in recent years, they are found to be vulnerable to so-called adversarial examples Specifically, take x is an element of D on which a neural network has very high classification accuracy. It is possible to find some small perturbation Δx so that even though the difference between x and x + Δx = x′ is almost imperceptible to humans, the given neural network is very likely to incorrectly classify x + Δx. Several gradient and optimization based methods have been proposed to create such adversarial examples x′, but many of them cannot achieve high speed and high quality x′ simultaneously. In this thesis, we propose a new algorithm to generate adversarial examples based on Generative Adversarial Networks (GANs), specifically, a modification to the training algorithm of the Improved Wasserstein GAN. The trained generator is able to create x′ very similar to the original x while keeping the classification accuracy of the target model as low as the state-of-the-art attack. Furthermore, although training a GAN might be slow, after it is trained, it can generate adversarial examples much faster than previous optimization-based methods. Our goal is for this work to be used for further research on robust neural networks.U of I Onlyundergraduate senior thesis not recommended for open acces

    Near-Optimal Randomized Exploration for Tabular MDP

    Full text link
    We study exploration using randomized value functions in Thompson Sampling (TS)-like algorithms in reinforcement learning. This type of algorithms enjoys appealing empirical performance. We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case O~(HSAT)\widetilde{O}\left(H\sqrt{SAT}\right) regret bound for episodic time-inhomogeneous Markov Decision Process where SS is the size of state space, AA is the size of action space, HH is the planning horizon and TT is the number of interactions. This bound polynomially improves all existing bounds for TS-like algorithms based on randomized value functions, and for the first time, matches the Ω(HSAT)\Omega\left(H\sqrt{SAT}\right) lower bound up to logarithmic factors. Our result highlights that randomized exploration can be near-optimal, which was previously only achieved by optimistic algorithms. To achieve the desired result, we develop 1) a new clipping operation to ensure both the probability being optimistic and the probability being pessimistic are lower bounded by a constant, and 2) a new recursion formula for the absolute value of estimations errors to analyze the regret.Comment: 42 page
    corecore